Chicago Bulls Prospective Player Analysis
2019-20

1. Introduction:

This interactive website is the tangible report component of a reproducible data analysis project. The project is based upon a fictitious task given to the data analytics team for the Chicago Bulls NBA organisation to provide a prospective player analysis report for the General Manager to help rebuild the team for next season. The task detailed the assessment of potential players to join/retain for the Chicago Bulls organisation for the 2019-20 NBA season.


2. Report scenario:

This project is based around the “Moneyball” theory of using statistical analysis to provide a greater insight into sport performances, in this case the selection/purchase of players from the 2018-19 season of the NBA who would help produce greater results for the Chicago Bulls organisation to improve on their past season result (finishing 13th in the Eastern Conference, and 27th overall on Win-Loss ratio) and provide an improved result for the upcoming 2019-2020 NBA season.

The assigned task included the following:

  • The assessment of potential players to purchase or retain for the Chicago Bulls organisation for the 2019-20 NBA season.
  • Projection of expected results with selected players.
  • Selection of 5 players, one from each position
    • Center = C
    • Power Forward = PF
    • Small Forward = SF
    • Shooting Guard = SG
    • Point Guard = PG
  • Ensure purchase of the 5 players was within the allotted budget of $118 million dollars.
  • The proposed purchases must allow enough budget to still field the other remaining players required for an NBA team (NBA teams are allowed 15 players total).

The use of statistics in sport is not a new phenomenon1 2, partly due to influences from the likes of Bill James [Bill james basketball stats link] and John Hollinger3 who implemented and revolutionised the use of statistical analysis, which is now common within sports like basketball and in particular the North American basketball league the NBA. John Hollinger created the all in one metric the Player Efficiency Rating or PER, which allowed for the collection of several variables (i.e both positive and negative outcomes e.g points, turnovers, free throw misses, personal fouls etc.) to be used as an indicator of player performance, and especially be able to be used as an inter and intra reliable measure.


Project aim:

The hypothesis for this project is based on the use of a combination of known analysis methods/variables to create a predictive equation to aid in the selection of appropriate players for the Chicago Bulls 19/20 season in the NBA.4

By selecting players that attained above specific values in the selected key metrics, it is hypothesized that an increase in points/min could be achieved which is associated to an increased win percentage.

As such, the project aims to prove that choosing players that contribute to increasing the team points per minute average, would equate to increased team wins, with a goal of achieving greater than 42 wins/50% win percentage and progressing to playoffs5.

The purpose and problem that this method of analysis addresses is a way to see through the inflated market values for athletes and highlight the true value of players based on their repeated habits and trends of play. I believe that the predictive formula below can provide valuable insight into the real value and contribution players are making/could make in a new team.


Positions and key metrics used in the NBA

Basketball has 5 positions. Although the roles are fixed, there is some variation to the roles, and it is common for some players to play across two positions, depending on the team/other members of the team.

Positions and key roles to look at:

  • PF = Power Forward
    • Offence = Playing near to the basket, rebounding.
      • Does have shooting role - 2P > 3P
    • Defence = Defending taller players and rebounds.
  • C = Center
    • Offence = Tries to score on close shots and gather offencive rebounds.
      • Predominately 2P
    • Defence = Centre tries to block opponents’ shots and rebound their misses.
  • PG = Point Guard
    • Offence = Runs plays, shooter, passer, dribble.
      • Good shooter 3P > 2P
    • Defence = Defensively looks to steal from opposing PG
  • SG = Shooting Guard
    • Offence = Predominantly a shooter, dribbler and passer 3P
    • Defence = steals and blocks
  • SF = Small Forward
    • Offence = Plays within the key - Shoots regularly - close and far. Universal player
    • Defence = Universal role

As the game of basketball has evolved, so have the tools used to measure a players performance. The list of useable metrics used and recorded from a standard NBA game are long and each interpretable variable has had its time in the limelight.

Variables/Metrics targeted in the analysis:

The variables selected were used to show an association with an increase in overall win percentage due to an increase of points per minute played.

The variables used for the predictive value were:

  • Effective Field Goal Percentage (eFGp)
  • Trade Value (TrV)
  • Efficiency rate (EFF)
  • Usage Rate (Tm_use)
  • Total Rebounds per minute (TRB_MP)
  • Points per minute (PTS_per_MP)

Dean Oliver6 refers to the “Four Factors” of Basketball adding that metrics/ratings can be broken down into four elements of the game: shooting, turnovers, rebounding, and getting to the foul line. It is in this framework that I believe that using a multifaceted approach to the analysis of player performance decreases the disparity between observed results and predictive results.

Points per minute \[ \beta_1 = -0.382 + 0.699 * eFGp + -0.0330 * TRB\_MP + 2.39 * Tm\_use\_total + 0.00000965 * EFF + -0.00000803 * TrV \]


Justification and importance:

The previous 2018-19 season saw the Chicago Bulls finish 27th out of 30 teams in the NBA (on win-loss record). The Chicago Bulls organisation has aspirations to rebuild their line-up and field a team with championship title potential for the upcoming 2019-20 season.

Background on variables:

By balancing statistical variables such as usage rates, efficiency ratings and other varying offensive/defensive ratings of the five players on a basketball court, a team can achieve optimal offensive output. This can bee seen through repeated game stats and team habits on the court7.

Interestingly the trends show that, for all players, as a player uses more possessions, his efficiency decreases. In the eyes of some statistical analysts what defines a superstar, is someone who can carry a larger proportion of a teams possessions and produce points with only a relatively small drop in efficiency. Meanwhile, the opposite is also true: Players perform more efficiently when they are asked to use fewer of their teams possessions. As a result, the greater burden on the superstar means that supporting players maintain low usage rates, allowing them to operate closer to their peak efficiency.

In an effort to determine how much impact players have on their teams, sports statisticians have developed metrics such as Usage Percentage. Examining Usage Percentage gives us an indication of how efficient a player is given the amount of possessions he uses.8

What defines a quality player is someone who can have a high Usage Percentage, but still plays at a high rate of efficiency. Teams can look at the Usage Percentage of players on their team, and determine how to balance usage across their lineup to maximize team efficiency.9

As with other combination metrics within sport, the aim of the predictive formula, albeit complicated on first glance, the basic idea is to look at a player’s combination of independent and dependent metrics and find the percentage of the team totals he uses in those same categories.3


Relevant calculations

The following calculations were used within the analysis process and are referred to regularly throughout this report.

Usage rate equation (TM_use_total)

Usage rate equation (TM_use_total)

  • Usage Rate Usage rate/usage percentage is an estimate of the percentage of how much team plays utilise a player while he was on the floor. The basis of the formula is to look at a player’s combination of field goal attempts, free throw attempts and turnovers, and find the percentage of the team totals he is used in those same categories.

    It is calculated by:
    \[ Useage(\%) = 100* \frac{((FGA+0.44*FTA+TOV)*(TM\_MP/5))}{(MP*(TM\_FGA+0.44*TM\_FTA+TM\_TOV))} \]

Effective field goal percentage (eFGp)

Effective field goal percentage (eFGp)

  • Effective Field Goal Percentage (eFGp) A statistic that adjusts field goal percentage to account for the fact that three-point field goals count for three points while field goals only count for two points. Its purpose is to equalise the field goal output percentage between two-point shooters and three-pointers shooters.

    It is calculated by: \[ eFG(\%) =\frac{FG+(0.5*3P)}{FGA} \]

Efficiency Value

Efficiency Value

  • Efficiency Value, is a metric invented by Martin Manley, is being considered the first ever player evaluation metric which indicates a players linear efficiency.10

    It is calculated by: \[ EFF = \frac{(PTS + TRB + AST + STL + BLK − (FGA-FG) − (FTA-FT) - TOV)}{GP} \]

Trade Value

Trade Value

  • Trade Value is the estimate using a players age and his approximate value to determine how much value a player has left in his career. Invented by Bill James.1112

    It is calculated by: \[ TrV = \frac{[(AV Formula - 27-0.75*Age)^2(27-0.75*Age +1)*AV Formula]}{190}+(AV Formula)*2/13 \] Approximate Value

  • Credit Formula and Approx Value are metrics which are an estimate of a players value, making no fine distinctions, but, rather, distinguishing easily between very good seasons, average seasons, and poor seasons13.

    It is calculated by:
    \[ AV Formula = \frac{(Credits^{3/4})}{21} \] Credits Formula

  • Credit Formula and Approx Value is an aggregation of observations from a standard game/season, in combination used within the approximate value calculation.

    It is calculated by:
    \[ Credits Formula = (PTS)+(TRB)+(AST)+(STL)+(BLK)-(FGA-FG)-(FTA-FT)-(TOV) \]

Total rebounds/minute played (TRB_MP)

Total rebounds/minute played (TRB_MP)

  • The calculation of total rebounds per minute is simple in nature and essential to add in active offensive and defensive rebound involvement at a per minute played value so as to be able to compare across players varying game time levels. Crucial inlfuences on statistical analysis within basketball reference the importance of rebounding as one of the “Four Factors.”

    It is calculated by:
    \[ TRB\_MP = \frac{(ORB-DRB)}{MP} \]

Points/minute played (PTS_per_MP)

Points/minute played (PTS_per_MP)

  • Points per minute was included in the analysis to accurately compare points across players. Per-minute ratings were also used to calculate players’ totals in other metrics including points, steals, blocks, assists, turnovers etc, and are calculated by taking the player’s total in the relevant metric and dividing by the total of minutes played.

    It is calculated by:
    \[ Points\_per\_MP = \frac{MP}{G} \]

Team Win % (WinP_TM)

Team Win % (WinP_TM)

  • To calculate winning percentage, the number of wins is divided by the number of games played. Team winning percentage was included in the model to explore the relationship of the individual player metrics and their contribution to a teams winning percentage.

    It is calculated by: \[ Win\% = \frac{TM\_W}{TM\_G} \]


3. Reading and cleaning the raw data

This section details the process undertaken for the reading, cleaning and exporting of tidy data frames for further analysis. For further information on how to replicate this project please see the hosted GitHub repo for further instructions.

Reading and cleaning steps.

1. Data sources


2. Cleaning process


The following steps were carried out to ensure the data was clean and processable ready for analysis:

  1. Files in *.csv format were imported “read” into the R program.
  2. Distinguishable names assigned to each data frame.
  3. Error checking and identify missing values across all data frames.
  4. Convert NA values (implicit to explicit).
  5. Remove/Dropping of any empty coloumns.
  6. Comparison across data frames for common variable names.
  7. Fix spelling/abbreviations of player names to ensure accurate data matching across data sets.
  8. Fix team names to match abbreviations across data sets.
  9. Checking for errors and missing values within the data sets.
  10. Merging of data frames into one data set to allow comparison of variables.
  11. Identify duplicates and collapse/aggregate values to have a one season value for each player.
  12. Address any class issues due to merging i.e. numeric values are numeric etc
  13. Creation of variables at a rate of minutes played.
  14. Creation of equations and new variables for predictive model analysis.
  15. Separation of transfer players within the 18/19 season, identified as TOT players.

3. Tidy data frames exported


  1. The data frames were exported to *.csv files into a separate folder in the working directory.
  2. Data frames were imported/read in from clean *.csv files for further analysis via R scripts in R.


4. Exploratory analysis:

Exploratory steps

1. NBA player data base

NBA player list for 2018-19 season

2. Variable distribution

  • Large distribution of variable values across the data sets.
  • Several outliers exist with high points/min (and other stats) due to minimal game time/games.
  • Some left-tailed skewness in distribution.
  • Identified thresholds of outlier high points of influence/leverage to test in

Box plot of Points/min per position

Team Win % vs Points/min

Histogram distribution of Effective field goal %

3. Variable relationships:

Relationship between EFF and eFGp


Relationship between Points/min and Team Win %

Relationship between eFGp % and Team Win %

Relationship between Points/min and EFF

4. Linear model assessments


Searching for confounding variables and single linear model:

Relationship between points per min and team winning percentage:

There appears to be a linear relationship between points/min and winning percentage. As points/min increases, there is an increase in team winning percentage.

5. Correlation co-efficient PTS_per_MP and WinP_Tm


The correlation co-efficient = 0.055 suggesting a moderate-strong positive correlation between Points/min and team winning percentage as 0.55 is approaching the value of 1.

Correlation coefficient
0.055602

6. Simple Linear Regression


term estimate std.error statistic p.value conf.low conf.high
(Intercept) 47.720858 2.850014 16.7440769 0.0000000 42.111684 53.33003
PTS_per_MP 5.853553 6.151282 0.9515989 0.3420875 -6.252917 17.96002

Call:
lm(formula = WinP_Tm ~ PTS_per_MP, data = df_nonTOT_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.1688  -9.5159   0.5934  10.4431  23.5914 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   47.721      2.850  16.744   <2e-16 ***
PTS_per_MP     5.854      6.151   0.952    0.342    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.24 on 292 degrees of freedom
Multiple R-squared:  0.003092,  Adjusted R-squared:  -0.0003225 
F-statistic: 0.9055 on 1 and 292 DF,  p-value: 0.3421

The intercept co-efficient = 47.7, meaning that when the team winning percentage is 0, the expected points per minute = 47.7, which does not make much practical sense, but is a starting point for the model. The slope co-efficient = 5.85, meaning that for every 1 unit that points per min is increased, expected points per minutes increase by 5.85. The r squared value = -0.0003225, meaning that 0.03225% of the variance in team winning percentage is explained by the variance in points per minute.


Call:
lm(formula = WinP_Tm ~ PTS_per_MP, data = df_nonTOT_clean)

Coefficients:
(Intercept)   PTS_per_MP  
     47.721        5.854  

7. Independence


The Durbin-Watson statistic = 0.01404402, which is not close to the recommended value of 2 meaning that the assumption of independence is possibly failed. However, this could be due to the filtered data set and the figures are from across teams and there is player movement/transfer between teams, which could influence the independence.

 lag Autocorrelation D-W Statistic p-value
   1       0.9832565    0.01404402       0
 Alternative hypothesis: rho != 0

8. Outlier identification and leverage points


Outliers

There does not appear to be any outliers as all standardised residuals are less than 3.

Leverage points

There are no hat values greater than 1, however it will be useful to investigate the points above 0.025, as they appear to stand out from the rest of the values.

A need to investigate the points above 0.025.

There are 5 points that could be influencing the model (1, 13, 45,104, 200). Determine if the points could be considered high influence

A need to investigate points above 0.015 that are standing out above the rest.

There are 11 points that could be influencing the model (1, 13, 14, 25, 45, 264, 269, 275, 283, 285, 288). This requires further assessment without the high influencing points

A re-run of the linear regression with filtered_df

# A tibble: 2 x 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    47.7       2.85    16.7   1.41e-44    42.1       53.3
2 PTS_per_MP      5.85      6.15     0.952 3.42e- 1    -6.25      18.0

Call:
lm(formula = WinP_Tm ~ PTS_per_MP, data = df_LinR_filtered)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.1688  -9.5159   0.5934  10.4431  23.5914 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   47.721      2.850  16.744   <2e-16 ***
PTS_per_MP     5.854      6.151   0.952    0.342    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.24 on 292 degrees of freedom
Multiple R-squared:  0.003092,  Adjusted R-squared:  -0.0003225 
F-statistic: 0.9055 on 1 and 292 DF,  p-value: 0.3421

The removal of high influence points does not show any change to the intercept of WinP_Tm or Pts_per_MP

However visually, plotted without high influence points it looks a lot cleaner and linear.

The graph show a more even spread, and less influenced by the high points.

9. Test for homoscedasticity within data set


A test for homoscedasticity shows that the assumption for homoscedasticity is upheld by plotting the residuals against the fitted values. As such, there does not appear to be evidence of heteroscedasticity.

10. Assesment of normality

Are the residuals normally distributed?

There appears to be some slight skewness and doesn’t look evenly distributed. This is likely from the points investigated for influence. These values did not appear to be influencing the results of the model. This left skewed tail could be due to the large spread of players points scoring. A possible option is to collect more data, and also there are potentially other factors that contribute to winning.

This simple linear regression demonstrates that PTS_per_MP is correlated with WinP_Tm in the NBA. Further analysis of the variables eFGp, EFF, TRB_MP, TrV, and Tm_use_total is proposed to assess their influence into PTS_per_MP and therefore WinP_Tm in a multiple linear regression.All assumptions have been satisfied, with some understanding of the bias of the data set and a multiple linear regression appears to be a robust statistical test to investigate to correlations in this data set. The decision to filter to 40 games for this linear regression was based on the idea to see what the most consistent players in the NBA were scoring, and how that influenced the Win %. This is important as we want players with a high team usage factor and thus high scoring to be influential at the Chicago Bulls.,

11. Linear model assessment

How many more Points/min can be attained when controlling for other factors

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.3821332 0.0180836 -21.1315032 0.0000000 -0.4177259 -0.3465405
eFGp 0.6985608 0.0314831 22.1884530 0.0000000 0.6365947 0.7605269
TRB_MP -0.0329833 0.0174981 -1.8849604 0.0604419 -0.0674237 0.0014572
Tm_use_total 2.3855803 0.0366744 65.0474831 0.0000000 2.3133963 2.4577642
EFF 0.0000096 0.0000043 2.2488204 0.0252797 0.0000012 0.0000181
TrV -0.0000080 0.0000104 -0.7740553 0.4395330 -0.0000284 0.0000124

The table above shows that for every 1 increase of eFGp, PTS_per_MP will increase by .699. This is not exactly a practical example as this could equate to a player shooting at greater than 100% for their effective field goal percentage and would then surpass the highest point scorer/min at .98. As such, this number should be viewed as a rate indicator, if you could improve your team’s eFGp across the board by 20 % you would see an increase in 0.13 to points/min.

# A tibble: 6 x 7
  term            estimate  std.error statistic   p.value    conf.low  conf.high
  <chr>              <dbl>      <dbl>     <dbl>     <dbl>       <dbl>      <dbl>
1 (Intercept)  -0.382      0.0181       -21.1   1.68e- 60 -0.418      -0.347    
2 eFGp          0.699      0.0315        22.2   2.73e- 64  0.637       0.761    
3 TRB_MP       -0.0330     0.0175        -1.88  6.04e-  2 -0.0674      0.00146  
4 Tm_use_total  2.39       0.0367        65.0   3.24e-174  2.31        2.46     
5 EFF           0.00000965 0.00000429     2.25  2.53e-  2  0.00000120  0.0000181
6 TrV          -0.00000803 0.0000104     -0.774 4.40e-  1 -0.0000284   0.0000124

12. Justification for data modelling


In an effort to determine how much impact players have on their teams, the use of multiple developed metrics such as Usage Percentage, Efficiency Rate, Effective Field Goal Percentage, Trade Value combined with hard points per minute data i.e Points per minute and Total Rebounds per minute gives a well rounded assessment of a players value, and future worth to an organisation7.

By selecting players that featured in greater than 40 games in the season and above select values in the selected key metrics, it is hypothesized that an increase in points/min could be achieved which is associated to an increased win percentage.

As such, choosing players that contribute to increasing the team points per minute average, would equate to increased team wins, with a goal of achieving greater than 42 wins or 50% win percentage and progressing to playoffs.

Players who were transferred during the season were excluded from the analysis and predictive model. The basis of this decision was for the purposes of finding the best value players it was important to see which players contributed regularly to the overall team Win % and were consistently playing in the NBA.

Players with greater than 40 NBA games in 18/19 and who are non-transfer players:


5. Data modelling:

This section covers:

  • Player data analysis:
  • Data modeling
    • Creating a multiple linear regression model,
  • Assumption checking
  • Model output and interpretation

The table below shows the non transfer NBA player group filtered to the select variables chosen for our analysis/model.

The selected variables are:

  • Position
  • Salary
  • Age
  • Team
  • Usage %
  • EFF Rate
  • Trade Value
  • eFG %
  • Points/min
  • Expected points/min
  • Total rebounds/min
  • Team Win %


NBA player group


Multiple regression

Assumption testing


Added-Variable Plots

Pairs Plots

Multicollinearity occurs when two or more of your explanatory variables are highly related with each other. It can lead to changes in the coefficient estimates and confusion around which variable is explaining the variance in the response variable. As such the multicollinearity undermines the statistical significance of an independent variable.

Below is a visual test for multicolinearity using a pairs plot.

Variance inflation factor

Variance Inflation Factor18 was assessed to identify the correlation between predictors (i.e. independent variables) in a model; its presence can adversely affect your regression results. The VIF estimates how much the variance of a regression coefficient is inflated due to the variables being too alike/related to each other.

Variance inflation factor
eFGp 1.428056
TRB_MP 1.265889
Tm_use_total 2.282247
EFF 3.430486
TrV 1.844460

As all of our values are between 1 and 5, it is safe to say that there is some correlation between them.

Variance between the predictive values used showed some “moderate correlation” when tested for multicollinearity. This is explainable due to the the aggregated nature of some of the statistics used i.e. similar variables were used across and within each metric.

Square root of VIF:

The square root of the VIF indicates how much larger the standard error increases compared to the scenario if that variable had 0 correlation to other predictors. From the table below all values are between 1.3 and 1.8, indicating a narrow margin of standard error. A solution to the standard error is to obtain more data across multiple seasons which will produce more precise coefficient estimates.

Square of VIF
eFGp 1.195013
TRB_MP 1.125117
Tm_use_total 1.510711
EFF 1.852157
TrV 1.358109

Model output and interpretation


Linear regression and assessment of fit shows the comparison between the predicted/expected values and the actual season observed values. Points above the line = under estimated, below the line = over estimated.

Assesment of fit



Assesment of fit for Player Points/min

Comparison of Actual vs Expected Team Points/min vs Win (%)

Predictive formula for Points/minute

Predictive model for line of best fit

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.3821332 0.0180836 -21.1315032 0.0000000 -0.4177259 -0.3465405
eFGp 0.6985608 0.0314831 22.1884530 0.0000000 0.6365947 0.7605269
TRB_MP -0.0329833 0.0174981 -1.8849604 0.0604419 -0.0674237 0.0014572
Tm_use_total 2.3855803 0.0366744 65.0474831 0.0000000 2.3133963 2.4577642
EFF 0.0000096 0.0000043 2.2488204 0.0252797 0.0000012 0.0000181
TrV -0.0000080 0.0000104 -0.7740553 0.4395330 -0.0000284 0.0000124

A predictive formula based off multiple regression model:

Using the approximate values drawn from the exploratory analysis below;

  • eFG = 0.55
  • TRB_MP = .2
  • Tm_use_total = 0.2
  • EFF = 1500
  • TrV = 600

\[ \beta_1 = -0.382 + 0.699 * 0.55 + -0.0330 * 0.2 + 2.39 * 0.20 + 0.00000965 * 1500 + -0.00000803 * 600 \]

Points/min
0.483507

The above selected figures were taken from the exploratory analysis gauged on a worst case scenario of player recruitment, with the minimum goal to achieve a points/min ration that would equate to a >50 Win %.



6. Results:


The predictive model hypothesized and created above showed increasing association throughout the model building process. Albeit not a direct correlation, there is evidence of a repeated positive association to predicting a points/minute rate in NBA players/games. The below graphs shows an analysis of points expected and actual points observed compared to the Win % of each team. The graph featured below shows the projected 50% Win line, highlighting the teams who consistently produced winning results and their respective selected variables.

Comparison of Actual vs Expected Team Points/min vs Win (%)

Actual observed points = Black, Expected/Predicted = Blue


7. Player Analysis and Recommendations:

Below is an analysis of teams winning percentage and there corresponding points/minute rate. Given Chicago’s finishing position (bottom left corner of graph below) in both the conference and its overall Win-Loss ratio. The task to provide recommendation for 5 new starting players is essential.


The graphs below are an interactive representation of the players and their respective metrics utilised within our predictive model.



Points per/min vs Win percentage:



Player vs Salary analysis:




Trade Value vs Salary:



Efficiency rate vs Points/min:



Player pool for selection.



The players presented below both individually and collectively will see an increase in the Chicago Bulls points per minute rate and in doing so give the organisation the best chance to consistently play at above a 50% Win rate to progress through to playoffs. The players

8. Limitations:

Albeit satisfying the majority of assumptions within acceptable levels, there are inherent biases within this project/model. Several teams throughout their season achieved greater than predicted scoring ratios as such highlighting that there are elements of game play that have not been accurately recorded/reported, in combination with injuries/illnesses that may affect the actual starting line-up of a team, these factors amongst others may be a contribution to varying of results.

There is inherent bias present within the predictive model; utilising explanatory variables that demonstrate correlation carries with it the dependency of execution of said trend, i.e. if a player is out of favour with a coach and not seeing game time can therefore not perform (for example the transfer players), or a certain player has a dependency on another player delivering him the ball in his key position will impact on the modeling and analysis.19

An element of survivorhsip bias is present, as the NBA hosts the best of the best players in the world and then using numeric trends to separate them could higlight a lack of independence of data/variables.

The selection of players who played multiple positions and for multiple teams during the 2018/2019 season were excluded from the analysis. The data set was filtered to display players who had played for 40 or more games, which is just under 50% of the games for the season, as it was evident that the better performing players in each position played the vast majority of the 82 games of the season.

Lastly, the inlcusion of some team specific factors within metrics could influence the perception of the individual player’s performance. For example, a really good player who’s usage rating is high may have a lower efficiency or poor eFG % due to lack of possession of the ball in scoring opportunities20. Conversely, you may have an average player in a very good team. This limitation was attempted to be addressed in the decision to bring the statistics down to a minute played ratio, to aid in the reduction of bias.




9. Summary:

This project highlighted several trends within the NBA data and the NBA overall standings results. This mode of retrospective/prospective analysis still relies on the game based execution of set actions/reactions. This can be seen within the confidence intervals within each predictive variable, showing the margin for difference between expected and observed.

As mentioned before, Dean Oliver refers to the “Four Factors” of Basketball adding that metrics/ratings can be broken down into four elements of the game:

  • Shooting
  • Turnovers
  • Rebounding, and
  • Getting to the foul line

These four elements, or “Four Factors,” allow a strategic framework of understanding to be extracted from the game.

It is in this framework that I believe that using a multifaceted approach to the analysis of player performance decreases the disparity between observed results and predictive results.

The purpose and problem that this method of analysis addresses is a way to see through the inflated market values for athletes and highlight the true value of players based on their repeated habits and trends of play. I believe that the predictive formula created in this analysis can provide valuable insight into the real value and contribution players are making/could make in a new team.



10. Glossary:

NBA standard terms:

Project specific:

  • Pos = Position
  • Tm = Team, abbreviated to three letters, i.e Chicago = CHI, Houston = HOU etc.
  • ’…_MP’ = Statistic at a per minute rate
  • ‘TM_…’ = Statistic as a team total
  • Tm_use_total = Usage Rate is the total use by the team as a percentage across the total number of minutes played
  • TrV = Trade Value as an estimation of athlete value taking into account athlete age and game based statistics
  • eFGp = Effective Field Goal Percentage (eFGp) allows comparison of 2 and 3 point shooters
  • EFF = Efficiency Value is an indicator of an athletes linear efficiency

Data frame specific

2018-19_nba_player-statistics.csv

This data file provides total statistics for individual NBA players during the 2018-19 season.
The variables consist:

  • player_name : Player Name
  • Pos : (PG = point guard, SG = shooting guard, SF = small forward, PF = power forward, C = center)
  • Age : Age of Player at the start of February 1st of that season.
  • Tm : Team
  • G : Games
  • GS : Games Started
  • MP : Minutes Played
  • FG : Field Goals
  • FGA : Field Goal Attempts
  • FG% : Field Goal Percentage
  • 3P : 3-Point Field Goals
  • 3PA : 3-Point Field Goal Attempts
  • 3P% : FG% on 3-Pt FGAs
  • 2P : 2-Point Field Goals
  • 2PA : 2-point Field Goal Attempts
  • 2P% : FG% on 2-Pt FGAs
  • eFG% : Effective Field Goal Percentage
  • FT : Free Throws
  • FTA : Free Throw Attempts
  • FT% : Free Throw Percentage
  • ORB : Offensive Rebounds
  • DRB : Defensive Rebounds
  • TRB : Total Rebounds
  • AST : Assists
  • STL : Steals
  • BLK : Blocks
  • TOV : Turnovers
  • PF : Personal Fouls
  • PTS : Points

    • NB: Players that were traded during the season may appear more than once (on more than one row) so it is important to handle these duplications appropriately.

2018-19_nba_player-salaries.csv

This data file contains the salary for individual players during the 2018-19 NBA season.
The variables consist:

  • player_id : unique player identification number
  • player_name : player name
  • salary : year salary in $USD

2019-20_nba_team-payroll.csv

This data file contains the team payroll budget for the 2019-20 NBA season.
The variables consist:

  • team_id : unique team identification number
  • team : team name
  • salary : team payroll budget in 2019-20 in $USD

2018-19_nba_team-statistics_1.csv

This data file contains miscellaneous team statistics for the 2018-19 season.

The variables consist:

  • Rk : Rank
  • Age : Mean Age of Player at the start of February 1st of that season.
  • W : Wins
  • L : Losses
  • PW : Pythagorean wins, i.e., expected wins based on points scored and allowed
  • PL : Pythagorean losses, i.e., expected losses based on points scored and allowed
  • MOV : Margin of Victory
  • SOS : Strength of Schedule; a rating of strength of schedule. The rating is denominated in points above/below average, where zero is average.
  • SRS : Simple Rating System; a team rating that takes into account average point differential and strength of schedule. The rating is denominated in points above/below average, where zero is average.
  • ORtg : Offensive Rating; An estimate of points produced (players) or scored (teams) per 100 possessions
  • DRtg : Defensive Rating; An estimate of points allowed per 100 possessions
  • NRtg : Net Rating; an estimate of point differential per 100 possessions.
  • Pace : Pace Factor: An estimate of possessions per 48 minutes
  • FTr : Free Throw Attempt Rate; Number of FT Attempts Per FG Attempt
  • 3PAr : 3-Point Attempt Rate; Percentage of FG Attempts from 3-Point Range
  • TS% : True Shooting Percentage; A measure of shooting efficiency that takes into account 2-point field goals, 3-point field goals, and free throws.
  • eFG% : Effective Field Goal Percentage; This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.
  • TOV% : Turnover Percentage; An estimate of turnovers committed per 100 plays.
  • ORB% : Offensive Rebound Percentage; An estimate of the percentage of available offensive rebounds a player grabbed while he was on the floor.
  • FT/FGA : Free Throws Per Field Goal Attempt
  • DRB% : Defensive Rebound Percentage

2018-19_nba_team-statistics_2.csv

This data file contains total team statistics for the 2018-19 NBA season.

The variables consist:

  • Team : Team name
  • Rk : Ranking
  • MP : Minutes Played
  • G : Games
  • FG : Field Goals
  • FGA : Field Goal Attempts
  • FG% : Field Goal Percentage
  • 3P : 3-Point Field Goals
  • 3PA : 3-Point Field Goal Attempts
  • 3P% : FG% on 3-Pt FGAs
  • 2P : 2-Point Field Goals
  • 2PA : 2-point Field Goal Attempts
  • 2P% : FG% on 2-Pt FGAs
  • FT : Free Throws
  • FTA : Free Throw Attempts
  • FT% : Free Throw Percentage
  • ORB : Offensive Rebounds
  • DRB : Defensive Rebounds
  • TRB : Total Rebounds
  • AST : Assists
  • STL : Steals
  • BLK : Blocks
  • TOV : Turnovers
  • PF : Personal Fouls
  • PTS : Points



Dr. Jocelyn Mara: Data Analysis in Sport PG21, University of Canberra, 2021
Martin Manley: Kansas City sports reporter and statistician, EFF calculation.
Bill James: Statistician, Trade Value calculation, Approximate Value calculation, Credits Calculation.
John Hollinger: Effective Field Goal percentage and Usage Rate calculation
Dean Oliver: Effective Field Goal percentage and Usage Rate calculation
Basketball-reference.com
Chicago Bulls Logo



This project was designed and built through RStudio, Version 1.4.1103, © 2009-2021 RStudio, PBC



11. References:

1.
Melo POS Vaz de, Almeida VAF, Loureiro AAF. Can complex network metrics predict the behavior of NBA teams? In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining [Internet]. New York, NY, USA: Association for Computing Machinery; 2008. p. 695–703. (KDD ’08).
2.
Khan E. Advanced NBA stats for dummies: How to understand the new hoops math [Internet]. https://bleacherreport.com/articles/1813902-advanced-nba-stats-for-dummies-how-to-understand-the-new-hoops-math; 2013. Accessed: 2021-5-17
3.
Usage percentage [Internet]. https://www.sportingcharts.com/dictionary/nba/usage-percentage.aspx; Accessed: 2021-5-16
4.
What are the best metrics to evaluate basketball players for your fantasy team? - quora [Internet]. https://www.quora.com/What-are-the-best-metrics-to-evaluate-basketball-players-for-your-fantasy-team; Accessed: 2021-5-8
5.
Fromal A. Understanding the NBA: Explaining advanced offensive stats and metrics [Internet]. https://bleacherreport.com/articles/1039116-understanding-the-nba-explaining-advanced-offensive-stats-and-metrics; 2012. Accessed: 2021-5-11
6.
JoBS: Roboscout and the four factors of basketball success [Internet]. http://www.rawbw.com/~deano/articles/20040601_roboscout.htm; Accessed: 2021-5-19
7.
Regression to the mean [Internet]. https://www.nbastuffer.com/analytics101/regression-to-the-mean/; 2017. Accessed: 2021-5-19
8.
Schuhmann J. Power rankings notebook: How james harden trade impacts all 4 teams [Internet]. https://www.nba.com/news/power-rankings-notebook-week-4; NBA.com; 2021. Accessed: 2021-5-8
9.
10.
Wikipedia contributors. Efficiency (basketball) [Internet]. https://en.wikipedia.org/w/index.php?title=Efficiency_(basketball)&oldid=1019192874; 2021. Accessed: 2021-5-8
11.
Trade value [Internet]. https://www.nbastuffer.com/analytics101/trade-value/; 2017. Accessed: 2021-5-9
12.
Trade value [Internet]. https://www.nbastuffer.com/analytics101/trade-value/; 2017. Accessed: 2021-5-7
13.
Approximate value (AV) [Internet]. https://www.nbastuffer.com/analytics101/approximate-value/; 2017. Accessed: 2021-5-9
14.
2018-19 NBA player stats: totals [Internet]. https://www.basketball-reference.com/leagues/NBA_2019_totals.html; Accessed: 2021-4-21
15.
NBA salaries [Internet]. https://hoopshype.com/salaries/; HoopsHype; Accessed: 2021-4-21
16.
NBA salaries [Internet]. https://hoopshype.com/salaries/; HoopsHype; Accessed: 2021-4-21
17.
2018-19 NBA season summary [Internet]. https://www.basketball-reference.com/leagues/NBA_2019.html; Accessed: 2021-4-21
18.
Stephanie. Variance inflation factor [Internet]. https://www.statisticshowto.com/variance-inflation-factor/; 2015. Accessed: 2021-5-19
19.
State of analytics: How the ’idiots who believe’ in the movement have forever changed basketball - stats perform [Internet]. https://www.statsperform.com/resource/state-of-analytics-how-the-idiots-who-believe-in-the-movement-have-forever-changed-basketball/; 2021. Accessed: 2021-5-17
20.
Usage rate [Internet]. https://www.nbastuffer.com/analytics101/usage-rate/; 2017. Accessed: 2021-5-19
21.
Mara J. Data analysis in sport - 9531 [Internet]. https://uclearn.canberra.edu.au/courses/9531/modules; 2021. Accessed: 2021-5-16